Objectives:
- 60 minutes of hands-on
- Statistical Modelling Part
- R for data analyses
- Experimental Design: Balanced design, unbalanced and Split Plot
- Blueberry case
Extra information
- About the author
- About the UF/IFAS Plant Breeding
Introduction
This talk aims to be a friendly introduction to introduction analysis in R. This hands-on was primary created based on a preliminary survey, where most of the registrants showed interested on how to work with data on R for analyses of experiment with focus on experimental design.
Below there are some relevant questions raised by participants that I have grouped together and will be addressed in block over this short course:
Introduction to R
- When you get your data analysis, what are the first steps you should take?
- What are some good resources for advancing my skills in R or other data analysis?
Experimental Design
- How to control heterogeneity in the field? (reduce noise)
- How can I use data analysis tools to make my research data more meaningful and presentable?
Case study
- How can I make the most out of my data set with discrepancies/inconsistencies?
- What has been your passion and drive for the past years and advice for the younger ones?
Statistical Modelling Process
Below, a summary of the main steps:
- Understand the problem and formulate your biological questions
- Plan and collect the data (experimental design)
- Explore the data (data visualization) (we need R !)
- Postulate a model (experimental design)
- Fit a model (we need R !)
- Check the model (we need R!)
- Go back to your model and try again
- Use the model (we need R!)
R for data analyses
What is R?
- Programming language and software environment specifically designed for statistical computing and data analysis.
- R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, in the early 1990s.
- Created as an open-source project, that allowed for extensive contributions from users.
Why do we need R?
- Wide Range of Packages
- Rich set of built-in functions and packages for performing a variety of statistical tests
- Data Visualization
- R has a large and active community of users and developers
- R encourages a reproducible workflow, which is crucial for maintaining transparency and ensuring that your analysis can be easily replicated
- R is designed to integrate with other programming languages and tools
Where should I start?
Nowadays you can find materials that fit well with your learning preference. If you prefer books, The Art of R programming and materials from Prof. Andy are good options. If you prefer video classes, you can find good courses on youtube and Coursera. Finally, if you prefer interactive modes, there are good options in the DataCamp, DataQuest and using the swirl R package.
What R packages should I use for statistics?
- It is not an easy question !
- Data manipulation:
dplyr - Data visualization:
ggplot2 - Linear Models:
lm - Mixed Models:
nlmeorlme4 - More complex mixed models:
asreml-R(not free)
Good resources for analyses in agriculture
- R package for Statistical in Agricultural Research:
agricolae - DSFAIR
- R package for GxE and biometric models:
metan
Experimental Design 101
Motivation
- Which experimental design best suits my goals?
- How can I fit a proper statistical model to my data?
- What sources of variation do I need to consider?
- To answer such questions we need experimentation
1) Replication
- Definition: is the application of the same treatment to two or more experimental units in the same experiment.
- Reason:
- Increasing the precision of estimated means
- Increasing the power of statistical tests
- Providing an estimate of experimental error
Some important considerations:
- Trade-off between number of treatments and number of replicates. Example: experiments to estimate the genetic variance of a population versus experiments for cultivar comparison.
- Experiments are designed to answer the following question that underlies the importance on estimating the experimental error: Are observed differences due to real treatment differences or to variation between experimental units?
- True (experimental unit is replicated) vs. technical replication (the experimental unit is subsampled)
2) Local Control
- Definition: Local control (or error control) is the use of experimental designs or techniques to reduce experimental error
- Reason:
- It prevents the detrimental effects of environmental variations within the experimental area
- If homogeneity is a clear issue, the experiment can be subdivided to increase efficiency
- Block is a common alternative
3) Randomization
Definition: Randomization is the random assignment of experimental units to treatments, such that all treatments have the same chance of being assigned to a given experimental unit
Reason:
- Providing unbiased estimates of error and treatment means
- Accounting for extraneous confounding variables
- Simply stating: makes an experiment valid
Balanced ANOVA Case
Motivation
- Let’s say that we want to make inferences about a given set of treatments
- We can set up an experiment trial to compare the effects of t different treatments
What layout should I use?
- Layout 1: Randomized Complete Block Design (RCBR), Constrain randomization to control for unwanted variation.
- Layout 2: Completely Randomized Design (CRD), randomization is absolutely unrestricted throughout the treatments. There is no local (error) control, the CRD is appropriate when experimental units are uniform
Statistical Analyses
- Statistical Model:
\[y_{ij} = \mu + \tau_i + \beta_j + e_{ij}\] where \(\mu\) is the intercept (an overall mean), \(\tau_i\) is the treatment effect, \(\beta_j\) is the block effect and \(e_{ij}\) is the associated random error term
Example
- Simulated data set
- 3 blocks
- 10 blueberry cultivars
- Response variable: fruit size
- Question: are the treatments different?
# Data set
load("/home/felipe/Dropbox (UFL)/classes/CORTEVA_handsOn/data_examples.Rdata")
data = simulated.rcbd
str(data)## 'data.frame': 30 obs. of 5 variables:
## $ block : int 1 1 1 1 1 1 1 1 1 1 ...
## $ treatment: int 1 2 3 4 5 6 7 8 9 10 ...
## $ response : num 11.19 9.5 12.3 9.63 9.08 ...
## $ row : num 1 1 1 1 1 1 1 1 1 1 ...
## $ col : int 4 2 3 9 7 8 10 5 1 6 ...
# Transform into factors
data$block = as.factor(data$block)
data$treatment = as.factor(data$treatment)
# Boxplots
par(mfrow=c(1,2))
boxplot(data$response~data$treatment, main="Treatments")
boxplot(data$response~data$block, main = "Block")
# Explore
library(desplot)desplot(
data = simulated.rcbd,
flip = TRUE, # row 1 on top, not on bottom
form = response ~ row + col, # fill color according to yield
out1 = block, # line between blocks
text = treatment, # cultivar names per plot
cex = 1, # cultviar names: font size
shorten = FALSE, # cultivar names: don't abbreviate
main = "Berry Size", # plot title
show.key = FALSE # hide legend
) # ANOVA model
m0 <- lm(response ~ block + treatment, data = data)
anova(m0)## Analysis of Variance Table
##
## Response: response
## Df Sum Sq Mean Sq F value Pr(>F)
## block 2 12.472 6.2357 6.4269 0.007829 **
## treatment 9 22.311 2.4790 2.5550 0.043136 *
## Residuals 18 17.465 0.9703
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# plots
par(mfrow = c(2, 2))
plot(m0)# Marginal means and
library(emmeans)
library(agricolae)
emm <- emmeans(m0, specs = ~treatment)
emm## treatment emmean SE df lower.CL upper.CL
## 1 11.41 0.569 18 10.21 12.60
## 2 9.49 0.569 18 8.30 10.68
## 3 11.27 0.569 18 10.07 12.46
## 4 10.08 0.569 18 8.89 11.28
## 5 9.49 0.569 18 8.30 10.69
## 6 10.86 0.569 18 9.67 12.06
## 7 8.48 0.569 18 7.29 9.68
## 8 9.66 0.569 18 8.46 10.85
## 9 9.65 0.569 18 8.45 10.84
## 10 9.85 0.569 18 8.66 11.05
##
## Results are averaged over the levels of: block
## Confidence level used: 0.95
# Differences
head(pairs(emm))## contrast estimate SE df t.ratio p.value
## treatment1 - treatment2 1.920 0.804 18 2.387 0.1576
## treatment1 - treatment3 0.140 0.804 18 0.174 1.0000
## treatment1 - treatment4 1.326 0.804 18 1.649 0.5244
## treatment1 - treatment5 1.917 0.804 18 2.384 0.1585
## treatment1 - treatment6 0.545 0.804 18 0.677 0.9856
## treatment1 - treatment7 2.929 0.804 18 3.642 0.0111
##
## Results are averaged over the levels of: block
## P value adjustment: sidak method for 6 tests
# Tukey test
tk_hsd <- HSD.test(m0, trt = "treatment")
#plot(tk_hsd)
tk_hsd$groups## response groups
## 1 11.409461 a
## 3 11.269555 ab
## 6 10.864881 ab
## 4 10.083279 ab
## 10 9.851906 ab
## 8 9.656233 ab
## 9 9.649170 ab
## 5 9.492250 ab
## 2 9.489942 ab
## 7 8.480569 b
Additional Questions
- Is blocking necessary? If not, we are wasting degrees of freedom estimating block effects, resulting in lower power for treatment effects.
- Can we check for the interaction between block and treatment? Only if we have replication in blocks.
- If designed as a RCBD, can I analyzed as a RCD? No !
- Blocks that are too big may not be optimal
- If there are a large number of blocks, then an ANOVA may not be the best approach, since we are using many DF to estimate effects for a “nuicance” factor.
Unbalanced ANOVA models
Motivation
- In some situations, it may be unfeasible to obtain enough homogeneous experimental units to construct complete blocks
- Blocks are not large enough to contain one repetition of each treatment
- Example:
- Sensory panels in blueberry
- We have 10 blueberry samples (treatments) that should be tasted by different panelists that will score the overall liking
- Each panelist is a block
- But, a maximum of 5 treatments (berries) can be tested per each panelist
Layout
- There are different kind of incomplete block designs
- Herein, we will first focus on the Balanced incomplete block. It is called balanced, because all treatments are tested with the same number of repetitions and in the same quantity. It is incomplete because we cannot set all treatments to all blocks.
- Example presented by Zeviani in this link
- 5 treatments, 10 blocks, block size is 3, 6 repetitions (how many time a block is repeated) and lambda is 3 (number of times that pairs of treatments occur in the same incomplete block)
Statistical Analyses
- Statistical Model:
\[y_{ij} = \mu + \tau_i + \beta_j + e_{ij}\] where \(\mu\) is the intercept (an overall mean), \(\tau_i\) is the treatment effect, \(\beta_j\) is the block effect and \(e_{ij}\) is the associated random error term
- Important: not all treatments occur in every block, such that some i, j combinations do not exist
- The treatment sum of squares needs to be adjusted, because not all treatments are present in each block
agricolaepackage
Example
- Simulated Sensory Data
- 5 treatments (blueberry genotypes)
- 10 blocks (panelists)
- Block size is 3 (each panelist can taste a maximum of 3 genotypes)
- 6 repetitions (how many time a block is repeated)
- lambda is 3 (number of times that pairs of treatments occur in the same incomplete block)
- Response variable is the overall liking ranging between 0 to 10
# Example
load("/home/felipe/Dropbox (UFL)/classes/CORTEVA_handsOn/data_examples.Rdata")
head(bib.data)## block treatment response
## 1 1 3 9.299370
## 2 1 4 6.127458
## 3 1 5 7.081114
## 4 2 1 6.569166
## 5 2 3 5.529899
## 6 2 4 1.940972
# Boxplots
par(mfrow=c(1,2))
boxplot(bib.data$response~bib.data$treatment, main="Treatments")
boxplot(bib.data$response~bib.data$block, main = "Block")# Connection plot
library(dplyr)
library(ggplot2)
da <- bib.data %>%
do({
k <- nlevels(.$treatment)
a <- seq(0, 2 * pi, length.out = k + 1)[-(k + 1)]
cbind(.,
data.frame(coord_x = sin(a)[as.integer(.$treatment)],
coord_y = cos(a)[as.integer(.$treatment)]))
})
ext <- c(-1.2, 1.2)
ggplot(data = da,
mapping = aes(coord_x, coord_y)) +
facet_wrap(facets = ~block, nrow = 2) +
geom_point() +
geom_polygon(fill = NA, color = "orange") +
geom_label(mapping = aes(label = treatment)) +
expand_limits(x = ext, y = ext) +
coord_equal()# Model
# Incomplete case (not orthogonal, or estimates are not independent)
anova(lm(response ~ treatment + block, data = bib.data))## Analysis of Variance Table
##
## Response: response
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 4 48.040 12.0100 12.6240 7.949e-05 ***
## block 9 36.702 4.0780 4.2865 0.005572 **
## Residuals 16 15.222 0.9514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(response ~ block +treatment, data = bib.data))## Analysis of Variance Table
##
## Response: response
## Df Sum Sq Mean Sq F value Pr(>F)
## block 9 52.069 5.7855 6.0812 0.0009150 ***
## treatment 4 32.673 8.1682 8.5858 0.0006722 ***
## Residuals 16 15.222 0.9514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Complete case
library(agricolae)
# Approximate the standard errors from the Tukey test
BIB.test(block = bib.data$block,
trt = bib.data$treatment,
y = bib.data$response,
test = "tukey",
console = TRUE)##
## ANALYSIS BIB: bib.data$response
## Class level information
##
## Block: 1 2 3 4 5 6 7 8 9 10
## Trt : 3 4 5 1 2
##
## Number of observations: 30
##
## Analysis of Variance Table
##
## Response: bib.data$response
## Df Sum Sq Mean Sq F value Pr(>F)
## block.unadj 9 52.069 5.7855 6.0812 0.0009150 ***
## trt.adj 4 32.673 8.1682 8.5858 0.0006722 ***
## Residuals 16 15.222 0.9514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## coefficient of variation: 17.8 %
## bib.data$response Means: 5.480534
##
## bib.data$treatment, statistics
##
## bib.data$response mean.adj SE r std Min Max
## 1 6.159150 6.385435 0.4288704 6 1.138812 4.694451 7.476364
## 2 3.771272 4.347151 0.4288704 6 1.589835 0.719765 5.441855
## 3 6.943653 6.923604 0.4288704 6 1.275444 5.529899 9.299370
## 4 4.171056 3.975718 0.4288704 6 1.833893 1.940972 6.450322
## 5 6.357539 5.770762 0.4288704 6 1.253130 4.402893 7.598103
##
## Tukey
## Alpha : 0.05
## Std.err : 0.4362021
## HSD : 1.889927
## Parameters BIB
## Lambda : 3
## treatmeans : 5
## Block size : 3
## Blocks : 10
## Replication: 6
##
## Efficiency factor 0.8333333
##
## <<< Book >>>
##
## Comparison between treatments means
## Difference pvalue sig.
## 1 - 2 2.0382838 0.0314 *
## 1 - 3 -0.5381684 0.9029
## 1 - 4 2.4097170 0.0095 **
## 1 - 5 0.6146735 0.8531
## 2 - 3 -2.5764522 0.0055 **
## 2 - 4 0.3714332 0.9727
## 2 - 5 -1.4236103 0.1929
## 3 - 4 2.9478854 0.0017 **
## 3 - 5 1.1528419 0.3716
## 4 - 5 -1.7950435 0.0668 .
##
## Treatments with the same letter are not significantly different.
##
## bib.data$response groups
## 3 6.923604 a
## 1 6.385435 a
## 5 5.770762 ab
## 2 4.347151 b
## 4 3.975718 b
More than one random effect: Mixed Models
A mixed model is a statistical model containing both fixed effects and random effects. The modelling process follows the same steps presented in this class, however, the decision as to whether certain effects are fixed or random is not immediately obvious. Below there are some comments about effects.
| Fixed | Random | |
|---|---|---|
| Objectives | compare specific levels of a certain factor | conclusions draw for a larger universe of interest |
| Sampling | treatment levels are selected by the investigator | treatment levels are randomly sampled |
| Statistically | hypothesis tests on the mean \(H_0: \tau_1 = \tau_2 = ...= \tau_n\) | hypothesis tests on the variance component \(H_0: \sigma^2_{\tau} = 0\) |
| Number of observation | any | Rule of thumb: >5-10 |
When do we need more than one random effect?
- Unbalanced Data
- Error is not independent (central assumption in ANOVA models)
- More than one residual (split plot design)
- Observational and hierarchical structure
- Correlated measures
Main R packages for Mixed Models
nlme
lme4
asreml-R
sommer
For more details, check this link
Split-Plot Desing
- It is common to design experiments to compare combinations of two different treatment factors
- For example: irrigation vs. genotypic performance
- In some cases, it is not feasible to design factorial experiments. Think on the irrigation process !
- Solution: arrange the experimental in two steps. First assigning the levels of one factor to the whole plots, which are then divided into subplots, to which the levels of the second factor are assigned
- Two (independent) randomizations
- Soil is randomly assigned to whole plots
- Genotypes are randomly assigned subplots within whole plots
- Kind of two experiments in one: (i) Completely randomized design for soil and therefore we will have a random error for the whole plot; and (ii) second randomized design for the sub-plots that contains the experimental units for each genotype.
Model
\[y_{ijk} = \mu + \alpha_i + u_k + \beta_j + \gamma_{ij} + e_{ijk}\] where \(y_{ijk}\) is the response variable evaluated for the ith soil type, jth genotype in the kth main plot; \(u_k\) is the random effect for the kth main plot; \(e_k\) is the random error for the subplot; \(\alpha_i\) is the fixed effect for soil type; \(\beta_j\) is the fixed effect of genotype; and \(\gamma_{ij}\) is the interaction beteween soil type and genotype.
Example
- R package
agridat - Barley variety trial with fungicide treatments
- Yield: response variable
- 70 winter barley cultivar
- Two fungicide treatments were applied to the main (whole) plots
# Data set
load("/home/felipe/Dropbox (UFL)/classes/CORTEVA_handsOn/data_examples.Rdata")
str(barley.data)## 'data.frame': 560 obs. of 5 variables:
## $ block : chr "B1" "B1" "B1" "B1" ...
## $ fung : chr "F1" "F1" "F1" "F1" ...
## $ whole_plot: chr "B1_F1" "B1_F1" "B1_F1" "B1_F1" ...
## $ gen : chr "G54" "G44" "G68" "G59" ...
## $ yield : num 5.89 6.17 5.68 5.85 5.8 6.01 5.89 4.53 5.32 5.36 ...
data = barley.data
data$block = as.factor(data$block)
data$fung = as.factor(data$fung)
data$gen = as.factor(data$gen)
data$whole_plot= as.factor(data$whole_plot)
# Boxplots
par(mfrow=c(1,2))
boxplot(data$yield~data$fung, main="whole plot")
boxplot(data$yield~data$gen, main = "split_plot")# ANOVA models
ANOVA <- aov(yield ~ block + fung*gen + Error(whole_plot), data)
summary(ANOVA)##
## Error: whole_plot
## Df Sum Sq Mean Sq F value Pr(>F)
## block 3 15.23 5.08 4.865 0.11321
## fung 1 42.02 42.02 40.272 0.00791 **
## Residuals 3 3.13 1.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## gen 69 39.28 0.5693 7.201 <2e-16 ***
## fung:gen 69 5.09 0.0738 0.933 0.629
## Residuals 414 32.73 0.0791
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Mixed Model with equivalent results
library(lme4)
fmixed <- lmer(yield ~ fung*gen + (1 | block/whole_plot), data)
anova(fmixed) # fixed effects## Analysis of Variance Table
## npar Sum Sq Mean Sq F value
## fung 1 3.184 3.1841 40.2718
## gen 69 39.284 0.5693 7.2008
## fung:gen 69 5.090 0.0738 0.9331
summary(fmixed)$varcor # variance components## Groups Name Std.Dev.
## whole_plot:block (Intercept) 0.11737
## block (Intercept) 0.16972
## Residual 0.28119
plot(fmixed)# Multiple Comparisons
library(emmeans)
fm_means <- emmeans(fmixed, "gen", by = "fung")
head(pairs(fm_means))## contrast fung estimate SE df t.ratio p.value
## G01 - G02 F1 -0.138 0.199 414 -0.692 0.9823
## G01 - G03 F1 -1.295 0.199 414 -6.513 <.0001
## G01 - G04 F1 0.935 0.199 414 4.703 <.0001
## G01 - G05 F1 -0.055 0.199 414 -0.277 0.9999
## G01 - G06 F1 0.005 0.199 414 0.025 1.0000
## G01 - G07 F1 -0.180 0.199 414 -0.905 0.9350
##
## Degrees-of-freedom method: kenward-roger
## P value adjustment: sidak method for 6 tests
Planing Experimental Design
Layout
agricolaepackage can help on designing large experiments- It provides the layout considering number of repetitions, randomization and local control
library(agricolae)
RCBD <- design.rcbd(trt, n_reps) # rcbd
LS <- design.lsd(trt) # latin square
SPLITPLOT <- design.split(trt, trt2, n_reps,design = "rcbd") # split-plot with block design
BIB <- design.bib(trt, k = 4, r = 4) # Incomplete block design
LATTICE <- design.lattice(trt, r = 2)# Lattice
FACTORIAL <- design.ab(factor_levels, n_reps) # Factorial
Visualization
desplotR package- Check this nice cookbook DSFAIR: Data Science for Agriculture in R
# Example of factgorial
library(agricolae)
library(dplyr)
library(desplot)
# 4 soil treatments
TrtSoil <- paste0("S", 1:4) # S1 - S4
n_TrtSoil <- n_distinct(TrtSoil) # 4
TrtSoil## [1] "S1" "S2" "S3" "S4"
# 6 fertilizer
TrtFert <- paste0("N", 1:6) # N1 - N6
n_TrtFert <- n_distinct(TrtFert) # 6
TrtFert## [1] "N1" "N2" "N3" "N4" "N5" "N6"
# number of replicates (blocks)
n_Reps <- 3
# Factorial
fac2rcbd_out <- design.ab(trt = c(n_TrtFert, n_TrtSoil),
design = "rcbd",
r = n_Reps,
seed = 42)
# Add Row and Col
fac2rcbd_out$bookRowCol <- fac2rcbd_out$book %>%
bind_cols(expand.grid(Row = 1:n_TrtFert,
Col = 1:(n_TrtSoil*n_Reps))) %>%
mutate(TrtFert = paste0("N", A),
TrtSoil = paste0("S", B))
head(fac2rcbd_out$bookRowCol)## plots block A B Row Col TrtFert TrtSoil
## 1 101 1 2 1 1 1 N2 S1
## 2 102 1 3 1 2 1 N3 S1
## 3 103 1 4 4 3 1 N4 S4
## 4 104 1 1 3 4 1 N1 S3
## 5 105 1 6 1 5 1 N6 S1
## 6 106 1 3 4 6 1 N3 S4
# Plot field layout
desplot(block ~ Col + Row | block, flip = TRUE,
out1 = Row, out1.gpar = list(col = "grey", lty = 1),
out2 = Col, out2.gpar = list(col = "grey", lty = 1),
text = TrtFert, cex = 1, shorten = "no", col=TrtSoil,
data = fac2rcbd_out$bookRowCol,
main = "RCBD",
show.key = T, key.cex = 0.5)Case Study: experimental design and GWAS
What is a GWAS analyses?
Genomic-wide association analyses (GWAS) measure hundreds of thousands of genetic variants (typically Single Nucleotide Polymorphisms, or SNPs), in hundreds with the primary goal being to identify which regions of the genome harbor SNPs that affect some phenotype or outcome of interest. The focus of this tutorial is on GWA analysis of common variants that involves testing association of each SNP independently and subsequently characterizing findings through a variety of visual and analytic tools.
See the example below:
- We have a a single chromosome with 10 SNPs
- We have an hyphotetical major QTL close to the SNP8 and SNP9
- Let’s assume the linkage disequilibrium between the QTL and the markers in this toy example is created by physical linkage and not for any other biological source
- How can I quantify the association between a QTL and SNPs?
What is the impact on using a proper phenotypic model for GWAS studies?
The problem
- In the blueberry breeding program, fruit quality traits are collected from multiple seasons in different locations
- When conducting cycles of selection, we do not have enough material (and space) available to replicate all genotypes.
- It is more efficient to screen a large number of genotypes, instead of spending the same resources to replicate a few genotypes
- Augmented (or Federer) design:
- Two categories of treatments: checks (or controls, or standards) and regular (new treatments)
- Below, we have 3 blocks, with 4 checks and 33 regular treatments
- Checks should occur once in every block
- Randomization is carried out in steps: first the checks and next the regular treatments
Pros
- Can be repeated in multiple environments, allowing for the investigation of genotype by environment interaction
- An important alternative when conducting the first cycles of selection, when you may not have enough material available to replicate these genotypes
- Important when doing on-farm research
Cons
- Number of degrees of freedom for the experimental error is due only to the checks
- Less experimental precision
What is a check?
- A check is a set of replicated treatments
- It allows experimental error to be adequately estimated
- Adapted to the testing environment
- Responsive, to provide a better estimate of block-to-block variation
- Well characterized
- Example: cultivars
How to record the data?
- Create a new factor effect (check)
- Create a dummy variable (new) for all entries to map the check vs. regular treatment. It will allow us to estimate genetic variances and the error only for regular treatment and checks, respectively.
| Treatment | response | Check | New | Block |
|---|---|---|---|---|
| Ck1 | 100 | Ck1 | 0 | 1 |
| Ck2 | 105 | Ck2 | 0 | 1 |
| Ck1 | 102 | Ck1 | 0 | 2 |
| Ck2 | 110 | Ck2 | 0 | 2 |
| T1 | 90 | Reg | 1 | 1 |
| T2 | 96 | Reg | 1 | 1 |
| T3 | 120 | Reg | 1 | 2 |
| T4 | 123 | Reg | 1 | 2 |
Model and R codes
\[y_{ij} = \mu + \tau_i + \beta_j + e_{ij}\]
where \(\mu\) is the overall mean, \(\tau_i\) are treatment effects separated into regular and check effects, \(\beta_j\) are block effects and \(e_{ij}\) are the associated error random variables.
# fixed checks and random entries
fmer <- lmer(response ~ Check + (1 | Block) + (1 |New:treatment), data)
# asreml-R
M1 <- asreml(responde = response ∼ Block + Check,
random = ∼New:Treatment,
data)Blueberry Results
- Scenario 1: ignored the augmented design and checks for the phenotypic analyses
- Scenario 2: used an augmented block design, estimated the eBLUEs and ran a GWAS
Conclusions
- “To consult the statistician after an experiment id finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”(Ronald Fisher)
- “Essentially, all models are wrong, but some are useful” (George E. P. Box)
References
- Casella, G. Statistical Design. (2008)
- DSFAIR – Data Science for Agriculture in R link
- Manual de Planejamento & Analises de Experimentos com R link
- Federer, W. T. Augmented (or hoonuiaku) designs. Hawaiian Planters’ Record 55, 191–208 (1956).
- Federer, W. T. Augmented designs with one-way elimination of heterogeneity. Biometrics 17, 447–473 (1961).
- Federer, W. T. & Raghavarao, D. On Augmented Designs.Biometrics 31, 29–35 (1975).
- Steel, R. G. & Torrie, J. H. Principles and Procedures of Statistics: A Biometrical Approach. 2nd Edition. (1980).